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^ : ABSTRACT 

O 1 

We describe two parallel algorithms for line opacity calculations based on 
a local file and on a global file approach. The performance and scalability of 
both approaches is discussed for different test cases and very different paral- 
lel computing systems. The results show that a global file approach is more 
efficient on high-performance parallel supercomputers with dedicated paral- 
lel I/O subsystem whereas the local file approach is very useful on farms of 
workstations, e.g., cheap PC clusters. 



1. Introduction 

In the first 2 papers of this series (Hauschildt, Baron, & Allard 1997; Baron & Hauschildt 
1998, hereafter paper I and II, respectively) we have described general parallel algorithms 
that we have implemented in our multi-purpose stellar atmosphere code PHOENIX. These 
papers focused mainly on radiative transfer and NLTE problems and general paralleliza- 
tion issues. In this paper we discuss in greater depth the problem of line opacity cal- 
culations. This is in particular important if extensive line lists are used, i.e., molecular 
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line databases. These databases have increased dramatically over the last decade, mostly 
due to the work of Miller et al. (1994); Schryber, Miller, & Tennyson (1995); Partridge 
& Schwenke (1997) on water vapor lines and Jorgensen (1994); Schwenke (1998) on TiO 
lines. Currently our molecular line database contains about 550 million lines, most of 
which are TiO and H2O lines. Using these databases for opacity calculations poses a sig- 
nificant challenge both for the construction of opacity tables and for the construction of 
detailed model atmospheres. 

In our model atmosphere code we have implemented and used direct opacity sam- 
pling (dOS) for more than a decade with very good results. During that time, the size 
of the combined atomic and molecular line databases that we used has increased from a 
few MB to > 8 GB. Whereas the floating point and memory performance of computers 
has increased dramatically in this time, I/O performance has not kept up with this speed 
increase. Presently, the wallclock times used by the line selection and opacity modules are 
dominated by I/O time, not by floating point or overall CPU performance. Therefore, 1/ O 
performance is today more important that it was 10 years ago and has to be considered a 
major issue. The availability of large scale parallel supercomputers that have effectively 
replaced vector machines in the last 5 years, has opened up a number of opportunities for 
improvements of dOS algorithms. Parallel dOS algorithms with an emphasis on the han- 
dling of large molecular line databases are thus an important problem in computational 
stellar atmospheres. These algorithms have to be portable and should perform well for 
different types of parallel machines, from cheap PC clusters using Ethernet links to high 
performance parallel supercomputers. This goal is extremely hard to attain on all these 
different systems, and we, therefore, consider two different parallel dOS algorithms in 
this paper and compare their performance on two very different parallel machines. In 
the next sections we will discuss the direct opacity sampling method, describe the paral- 
lel algorithms in detail and then discuss the results of test calculations. We close with a 
summary and conclusions. 



2. Direct Opacity Sampling 

There are a number of methods in use to calculate line opacities. The classical meth- 
ods are statistical and construct tables that are subsequently used in the calculation. The 
Opacity Distribution Function (ODF) and its derivative the k-coefficient method have 
been used successfully in a number of atmosphere and opacity table codes (e.g., Kurucz 
1992). This method works well for opacity table and model construction but cannot be 
used to calculate detailed synthetic spectra. A second approach is the opacity sampling 
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(OS) method (e.g., Peytremann 1974). This is a statistical approach in which the line opac- 
ity is sampled on a fine grid of wavelength points using detailed line profiles for each 
individual spectral line. In classical OS implementation, tables of sampling opacities are 
constructed for given wavelengths grids and for different elements. These OS tables are 
then used to calculate model atmospheres and, e.g., Rosseland mean opacities. The OS 
method has the advantage that is more flexible than the ODF approach and it also allows 
the construction of (typically) low resolution synthetic spectra. The drawback of tables 
in general is, however, an inherent inflexibility in terms of, e.g., the wavelength grid or 
the tables's resolution. For example, to properly account for the pressure broadening of 
lines an opacity sampling table would have to be a function of temperature and gas pres- 
sure, which leads to very large tables if many wavelength points are tabulated (this is not 
a problem for ODF tables as the number of wavelength bins in such models is typically 
very small). In addition, a different code is typically required to calculate high-resolution 
spectra from the model atmosphere constructed with the ODF or OS tables, which has the 
potential of introducing systematic errors (e.g., if the atmosphere /table and the synthetic 
spectrum codes are not synchronized). 

In direct opacity sampling (dOS) these problems are avoided by replacing the tables 
with a direct calculation of the total line opacity at each wavelength point for all layers 
in a model atmosphere (e.g. Hauschildt & Baron 1999). In the dOS method the relevant 
lines (defined by a suitable criterion) are first selected from master spectral line databases 
which include all available lines. The line selection procedure will typically select more 
lines than can be stored in memory and thus temporary line database files are created 
during the line selection phase. The file size of the temporary database can vary, in the- 
ory, from zero to the size of the original database or larger, depending on the amount of 
data stored for the selected lines and their number. For large molecular line databases 
this can easily lead to temporary databases of several GB in size. This is in part due the 
storage for the temporary line database: its data are stored for quick retrieval rather than 
in the compressed space saving format of the master line databases. The number and 
identity of lines that are selected from the master databases depends on the physical con- 
ditions for which the line opacities are required (temperatures, pressures, abundances for 
a model atmosphere) and thus the line selection has to be repeated if the physical condi- 
tions change significantly. As an optimization, it is easily possible to include only lines 
in the temporary database that can be "seen" by the wavelength grid that will be used in 
the calculation of the line opacities later on. This is important if, for example, only a nar- 
row range in wavelength is considered at high resolution in order to generate a synthetic 
spectrum. 

The temporary line databases are used in the next phase to calculate the actual line 
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opacity for each wavelength point in a prescribed (arbitrary) wavelength grid. This makes 
it possible to utilize detailed line profiles for each considered spectral line on arbitrary 
wavelength grids. For each wavelength grid point, all (selected) lines within prescribed 
search windows (large enough to include all possibly important lines but small enough 
to avoid unnecessary calculations) are included in the line opacity calculations for this 
wavelength point. This procedure is thus very flexible, it can be used to calculate line 
opacities for both model atmosphere construction (with relatively few wavelength points) 
and for the generation of high-resolution synthetic spectra. Its main drawback is that 
the line selection and (in particular) line profile calculations are more costly than table 
interpolations. 



3. Parallel Algorithms 

There are currently a large number of significantly different types of parallel ma- 
chines in use, ranging from clusters of workstations or PCs to teraflop parallel super- 
computers. These systems have very different performance characteristics that need to 
be considered in parallel algorithm design. For the following discussion we assume this 
abstract picture of a general parallel machine: The parallel system consists of a number 
of processing elements (PEs), each of which is capable of executing the required parallel 
codes or sections of parallel code. Each PE has access to (local) memory and is able to 
communicate data with other PEs through a communication device. The PEs have access 
to both local and global filesystems for data storage. The local filesystem is private to each 
PE (and inaccessible to other PEs), and the global filesystem can be accessed by any PE. A 
PE can be a completely independent computer such as a PC or workstation (with single 
CPU, memory, and disk), or it can be a part of a shared memory multi-processor system. 
For the purposes of this paper, we assume that the parallel machine has both global and 
local logical filesystem storage available (possibly on the same physical device). The com- 
munication device could be realized, for example, by standard Ethernet, shared memory, 
or a special-purpose high speed communication network. 

In the following description of the 2 algorithms that we consider here we will make 
use of the following features of the line databases: 

• master line databases: 

1. are globally accessible to all PEs 

2. are are sorted in wavelength 
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3. can be accessed randomly in blocks of prescribed size (number of lines) 
• selected line (temporary) databases: 

1. the wavelength grid is known during line selection (not absolutely required 
but helpful) 

2. have to be sorted in wavelength 

3. can be accessed randomly in blocks of prescribed fixed size (number of lines) 

4. are stored either globally (one database for all PEs) or locally (one for each PE) 

5. are larger than the physical memory of the PEs 

3.1. Global Temporary Files (GTF) 

The first algorithm we describe relies on global temporary databases for the selected 
lines. This is the algorithm that was implemented in the versions of PHOENIX discussed in 
papers I and II. In the general case of N available PE's, the parallel line selection algorithm 
uses one PE dedicated to I/O and (N — 1) line selection PEs. The I/O PE receives data 
for the selected lines from the line selection PEs, assembles them into properly sorted 
blocks of selected lines, and writes them into the temporary database for later retrieval. 
The (N — 1) line selection PEs each read one block from a set of (N — 1) adjacent blocks of 
line data from the master database, select the relevant lines, and send the necessary data 
to the I/O PE. Each line selection PE will select a different number of lines, so the I/O 
PE has to perform administrative work to construct sorted blocks of selected lines that it 
then writes into the temporary database. The block sizes for the line selection PEs and for 
the temporary database created by the I/O PE do not have to be equal, but can be chosen 
for convenience. The blocks of the master line database are distributed to the (N — 1) line 
selection PEs in a round robin fashion. Statistically this results in a balanced load between 
the line selection PEs due to the physical properties of the line data. 

After the line selection phase is completed, the temporary global line database is used 
in the line opacity calculations. If each on the N PEs is calculating line opacities (poten- 
tially for different sets of wavelengths points or for different sets of physical conditions), 
they all access the temporary database simultaneously, reading blocks of line data as re- 
quired. In most cases of practical interest, the same block of line data will be accessed 
by several (all) PEs at roughly the same time. This can be advantageous or problematic, 
depending on the structure of the file system on which the database resides. The PEs 
also cache files locally (both through the operating system and in the code itself through 
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internal buffers) to reduce explicit disk I/O. Note that during the line selection phase 
the temporary database is a write-only file, whereas during the opacity calculations the 
temporary database is strictly read-only. 

The performance of the GTF algorithm depends strongly on the performance of the 
global file system used to store the temporary databases and on the characteristics of the 
individual PEs. This issue is discussed in more detail below. 



3.2. Local Temporary Files (LTF) 

The second algorithm we consider in this paper uses file systems that are local to 
each PE; such local file systems exist on many parallel machines, including most clusters 
of workstations. This algorithm is a new development discussed here for the first time 
that tries to take advantage of fast communication channels available on parallel comput- 
ers and utilized local disk space space for temporary line list files. This local disk space 
is frequently large enough for the temporary line database and may have high local I/O 
performance. In addition, I/O on the local disks of a PE does not require any inter-PE 
communication, whereas globally accessible filesystems often use the same communica- 
tion channel that explicit inter-PE communication uses. The latter can lead to network 
congestion if messages are exchanged simultaneously with global I/O operations. 

For the line selection, we could use the algorithm described above with the difference 
that the I/O PE would create a global (or local) database of selected lines. After the line 
selection is finished, the temporary database could then simply be distributed to all PEs, 
and stored on their local disk for subsequent use. However, this is likely to be slower in 
all cases than the GTF algorithm. 

Instead, we use a "ring" algorithm that creates the local databases directly. In partic- 
ular, each of the N PEs selects lines for one block from the master database (distributed 
in round robin fashion between the PEs). After the selection for this one block is com- 
plete, each PE sends the necessary data to it next neighbor; PE i sends its results to PE 
i + lmod N sends to PE 0) and, simultaneously, receives data from the previous PE in the 
ring. This can be easily realized using the MPI (Message Passing Interface Forum 1995) 
mpi_sendrecv call, which allows the simultaneous sending and receiving of data for each 
member of the ring. Each PE stores the data it receives into a buffer and the process is 
repeated until the all selected lines from the N blocks are buffered in all N PEs. The PEs 
then transfer the buffered line data into their local temporary databases. This cycle is re- 
peated until the line selection phase is complete. The line opacity calculations will then 
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proceed in the same way as outlined above, however, the temporary line databases are 
now local for each PE. 

This approach has the advantage that accessing the temporary databases does not 
incur any (indirect) Network File System (NFS) communication between the PEs as each 
of them has its own copy of the database. However, during the line selection phase a 
much larger amount of data has to be communicated over the network between the PEs 
because now each of them has to "know" all selected lines, not only the I/O PE used in 
the first algorithm. 

The key insight here is that low-cost parallel computers constructed out of commod- 
ity workstations typically have a very fast communication network (100 Mbs to 1 Gbs) but 
relatively slow NFS performance. This means that trading off the extra communication 
for fewer NFS disk accesses in the LTF algorithm is likely to give better performance. 



4. Results 

The performance of the GTF and LTF algorithms will depend strongly on the type of 
parallel machine used. A machine using NFS with fast local disks and communication is 
likely to perform better with the LTF algorithm. However, a system with fast (parallel) 
filesystem and fast communication can actually perform better with the GTF algorithm. 
In the following we will consider test cases run on very different machines: 

1. A cluster of Pentium Pro 200MHz PCs with 64MB RAM, SCSI disks, lOOMbs full- 
duplex Ethernet communication network, running Solaris 2.5.1. 

2. An IBM SP system with 200MHz Power3 CPUs, 512MB RAM per CPU, 133 MB/s 
switched communication network, 16 node IBM General Purpose File System (GPFS) 
parallel filesystem, running AIX 4.3. 

We have run 2 test cases to analyze the behavior of the different algorithms on differ- 
ent machines. The small test case was designed to execute on the PPro system. It uses a 
small line database (about 550MB) with about 35 million lines of which about 7.5 million 
lines are selected. The second test uses an about 16 times larger line database (about 9GB) 
and also selects 16 times more lines. This large test could not be run on the PPro system 
due to file size limitations and limited available disk space. The line opacity calculations 
were performed for about 21,000 wavelength points that are representative for typical cal- 
culations. The tests were designed for maximum 1/ O usage and are thus extreme cases. 
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In practical applications the observed scaling is comparable to or better than which found 
for these tests and appears to follow the results shown here rather well. In the following 
we will discuss the results for the tests on the different computing systems. 



5. PPro/Solaris system 

The results for the line selection procedure on the PPro/Solaris system are shown 
in Fig. 1. It is apparent from the figure that the GTF approach delivers higher relative 
speedups that translate into smaller execution times for more than 2 PEs. For serial (1PE) 
and 2PE parallel runs the LTF line selection is substantially faster than the GTF algorithm. 
The reason for this behavior can be explained by noting that the access of the global files is 
done through NFS mounts that use the same network as the MPI messages. Therefore, n — 
1 PEs request different data blocks from the NFS server (no process was run on the NFS 
server itself) and send their results to the I/O PE, which writes it out to the NFS server. 
In the LTF algorithm, each PE reads a different input block from the NFS server and 
then sends its results (around the ring) to all other PEs. Upon receiving data from its left 
neighbor, a PE writes it to local disk. This means that the amount of data streaming over 
the network can be as much as twice as high for the LTF compared to the GTF algorithm. 
This increases the execution time for the LTF approach if the network utilization is close 
to the maximum bandwidth. In this argument we have ignored the time required to write 
the data to local disks, which would make the situation worse for the LTF approach. 

The situation is very different for the calculation of the line opacities, c.f . Fig. 2. Now 
the LTF approach scales well (up to the maximum of 8 available machines) whereas the 
GTF algorithm hardly scales to more than 2 PEs. The absolute execution times for the 
LTF approach are up to a factor of 4 smaller (more typical are factors around 2) than the 
corresponding times for the GTF algorithm (the GTF run with 8PEs required roughly as 
much execution time as the LTF run with 1PE!). The reason for this is clearly the speed 
advantage of the local disk 1/ O compared to the NFS based I/O in the GTF code. If more 
PEs are used in the GTF line opacity approach, the network becomes saturated quickly 
and the PEs have to wait for their data (the NFS server itself was not the bottleneck). The 
LTF approach will be limited by the fact that as the number of PEs get larger, the efficiency 
of disk caching is reduced and more physical I/O operations are required. Eventually this 
will limit the scaling as the execution time is limited by physical 1/ O to local disks. 
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6. IBM SP system 

We ran the same (small) test on an IBM SP for comparison. The tests were run on a 
non-dedicated production system and thus timings are representative of standard opera- 
tion conditions and not optimum values. The global files were stored on IBM's General 
Purpose File System (GPFS), which is installed on a number of system-dedicated I/O 
nodes replacing the NFS fileserver used on the PPro /Solaris system. GPFS access is fa- 
cilitated through the same "switch" architecture that also carries MPI messages on the 
IBM SP. The results for the line selection code are shown in Fig. 3. For the small test 
the results are markedly different from the results for the PPro /Solaris system. The LTF 
algorithm performs significantly better for all tested configurations, however, scaling is 
very limited. The GTF code does not scale well at all for this small test on the IBM SP. 
This is due to the small size, so that the processing is so fast (nearly 100 times faster than 
on the PPro/Solaris system) so that, e.g., latencies and actual line selection calculations 
overwhelm the timing. The IBM SP has a very fast switched communications network 
that can easily handle the higher message traffic created by the LTF code. This explains 
why the LTF line selection executes faster and scales better for this small test on the IBM 
compared to the PPro/Solaris system. 

The line opacity part of the test shown in Fig. 4 performs distinctively different in 
the IBM SP compared to the PPro/Solaris system (cf. Fig. 2). In contrast to the latter, the 
IBM SP delivers better performance for the GTF algorithm compared to the LTF code. 
The scaling of the GTF code is also significantly better than that of the LTF approach. 
This surprising result is a consequence of the high I/O bandwidth of the GPFS running 
on many I/O nodes, the I/O bandwidth available to GPFS is significantly higher than 
the bandwidth of the local disks (including all filesystem overheads etc). The I/O nodes 
of the GPFS can also cache blocks in their memory which can eliminate physical I/O to 
a disk and replace access by data exchange over the IBM "switch". Note that the test 
was designed and run with parameters set to maximize actual 1/ O operations in order to 
explicitly test this property of the algorithms. 

The results of the large test case, for which the input file size is about 16 times bigger, 
are very different for the line selection, cf. Fig. 5. Now the GTF algorithm executes much 
faster (factor of 3) than the LTF code. This is probably caused by the larger I/O perfor- 
mance of the GPFS that can easily deliver the data to all nodes and the smaller number 
or messages that need to be exchanged in the GTF algorithm. The drop of performance 
at 32PE's in the GTF line selection run could have been caused by a temporary overload 
of the I/O subsystem (these tests were run on a non-dedicated machine). In contrast to 
the previous tests, the LTF approach does not scale in this case. This is likely caused by 
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the large number of relatively small messages that are exchanged by the PEs (the line list 
master database is the same as for the GTF approach, so it is read through the GPFS as 
in the GTF case). This could be improved, e.g., by choosing larger block sizes for data 
sent via MPI messages, however, this will have the drawback of more memory usage and 
larger messages are more likely to block than small messages that can be stored within 
the communication hardware (or driver) itself. 

The situation for the line opacities is shown in Fig. 6. The scaling is somewhat worse 
due to increased physical I/O (the temporary files are about 16 times larger as in the 
small test case). This is more problematic for the LTF approach which scales very poorly 
for larger numbers of PEs as the maximum local 1/ O bandwidth is reached far earlier than 
for the GTF approach. This is rather surprising as conventional wisdom would favor local 
disk 1/ O over global filesystem 1/ O on parallel machines. Although this is certainly true 
for farms of workstations or PCs, this is evidently not true on high-performance parallel 
computers with parallel I/O subsystems. 



7. Summary and Conclusions 

In this paper we have discussed two algorithms for parallel spectral line selection 
and opacity calculations useful for direct opacity sampling models of stellar atmospheres. 
The GTF algorithm uses global temporary files to store the list of selected lines, whereas 
the LTF algorithm stores the scratch files on local disks. These two methods show very 
different performance characteristics on different parallel systems. On a PC cluster, each 
processing element (PE) being a PC with its own local disk space and operating system 
networked with standard IP based Ethernet, the LTF algorithm has disadvantages in the 
line selection procedure for larger numbers of PEs (> 4), probably due to the higher de- 
mand put on the communication between nodes using the MPI library, but is slightly 
faster than the GTF approach if the number of PEs is smaller. However the LTF code pro- 
duces far faster and better scaling line opacity calculations, which will be the more costly 
part of a typical atmosphere model run (the line selection is usually required only once at 
the beginning of a model calculation). 

On the common IBM SP parallel supercomputers the situation changes significantly. 
On this machine, the small test runs so fast that the timing is dominated by side effects. In 
the large test case, the GTF line selection performs and scales far better than the LTF code. 
This surprising result is caused by the presence of a parallel filesystem (GPFS) on the IBM 
that dramatically improves performance of global I/O compared to local disk I/O. The 
GPFS also boosts the performance for the GTF code for the (in practical applications more 
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important) line opacity calculations, for both the small and the large test cases. 

Variations of the algorithms can be constructed, e.g., it is possible to store the mas- 
ter line database on each PE individually and thus totally remove global I/O to a single 
master line list (this requires enough local disk space to store both the master and tempo- 
rary databases). Other improvements are possible, e.g., optimization of the 1/ O blocksize 
for each type of machine. However, these optimizations are system dependent (and also 
depend on the load of the machine in general) and thus are not discussed here. 

The algorithms and the results show that parallel computing can lead to dramatic 
speed improvements in stellar atmosphere calculations but also that different algorithms 
are required for different types and capabilities of parallel machines. The speed improve- 
ments can then be used to develop physically more complex and detailed models (e.g., 
including massive NLTE calculations with line blanketing, models for M dwarfs with pos- 
sibly billions of molecular spectra lines or detailed models for stellar winds and for hot 
stars with radiative levitation modeling including a large number of elements and ioniza- 
tion stages). This approach sends us one step further to a better physical understanding 
of stars and their spectra. 

Both algorithms we have described here are applicable to a wide range of problems 
that have data requirements that are larger than the available memory and thus need to 
perform out-of-core calculations. They can be trivially adapted to any case in which a 
large amount of shared data has to be utilized by a number of processors simultaneously 
and where it is not easy to use, e.g., a domain-decomposition approach to allow each 
processor to use only a distinct, smaller subset of the data. If the exchange of data can be 
arranged in a ring-like topology and the communication network of the parallel computer 
used is fast, then the LTF algorithm should be efficient, however, if the machine has a 
fast parallel filesystem, then the GTF approach is both simpler to implement and more 
efficient. 
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8. Figures 




Fig. 1. — Results for the line selection process on the PPro/Solaris system. Times are given 
in wallclock seconds in the upper panel and as relative speedups for the lower panel. 
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Fig. 2. — Results for the line opacity calculations on the PPro/Solaris system. Times are 
given in wallclock seconds in the upper panel and as relative speedups for the lower 
panel. 
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Fig. 3. — Results for the line selection process on the IBM SP system. Times are given in 
wallclock seconds in the upper panel and as relative speedups for the lower panel. 
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Fig. 4. — Results for the line opacity calculations on the IBM SP system. Times are given 
in wallclock seconds in the upper panel and as relative speedups for the lower panel. 
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Fig. 5. — Results for the line selection process for the large test on the IBM SP system. 
Times are given in wallclock seconds in the upper panel and as relative speedups for the 
lower panel. 
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Fig. 6. — Results for the line opacity calculations for the large test on the IBM SP system. 
Times are given in wallclock seconds in the upper panel and as relative speedups for the 
lower panel. 



